A machine learning approach for identification of gastrointestinal predictors for the risk of COVID-19 related hospitalization

Background and aim COVID-19 can be presented with various gastrointestinal symptoms. Shortly after the pandemic outbreak, several machine learning algorithms were implemented to assess new diagnostic and therapeutic methods for this disease. The aim of this study is to assess gastrointestinal and liver-related predictive factors for SARS-CoV-2 associated risk of hospitalization. Methods Data collection was based on a questionnaire from the COVID-19 outpatient test center and from the emergency department at the University Hospital in combination with the data from internal hospital information system and from a mobile application used for telemedicine follow-up of patients. For statistical analysis SARS-CoV-2 negative patients were considered as controls in three different SARS-CoV-2 positive patient groups (divided based on severity of the disease). The data were visualized and analyzed in R version 4.0.5. The Chi-squared or Fisher test was applied to test the null hypothesis of independence between the factors followed, where appropriate, by the multiple comparisons with the Benjamini Hochberg adjustment. The null hypothesis of the equality of the population medians of a continuous variable was tested by the Kruskal Wallis test, followed by the Dunn multiple comparisons test. In order to assess predictive power of the gastrointestinal parameters and other measured variables for predicting an outcome of the patient group the Random Forest machine learning algorithm was trained on the data. The predictive ability was quantified by the ROC curve, constructed from the Out-of-Bag data. Matthews correlation coefficient was used as a one-number summary of the quality of binary classification. The importance of the predictors was measured using the Variable Importance. A 2D representation of the data was obtained by means of Principal Component Analysis for mixed type of data. Findings with the p-value below 0.05 were considered statistically significant. Results A total of 710 patients were enrolled in the study. The presence of diarrhea and nausea was significantly higher in the emergency department group than in the COVID-19 outpatient test center. Among liver enzymes only aspartate transaminase (AST) has been significantly elevated in the hospitalized group compared to patients discharged home. Based on the Random Forest algorithm, AST has been identified as the most important predictor followed by age or diabetes mellitus. Diarrhea and bloating have also predictive importance, although much lower than AST. Conclusion SARS-CoV-2 positivity is connected with isolated AST elevation and the level is linked with the severity of the disease. Furthermore, using the machine learning Random Forest algorithm, we have identified the elevated AST as the most important predictor for COVID-19 related hospitalizations.


INTRODUCTION
Acute SARS-CoV-2 infection presents with variable symptoms associated with various organ systems. Typical symptoms of COVID-19 are fever, cough, and in the case of a more severe course of the disease, dyspnea with respiratory insufficiency occurs (Guan et al., 2020). In addition, COVID-19 may be presented with gastrointestinal symptoms, which include dominantly nausea, vomiting, diarrhea, anorexia and abdominal pain with relatively wide range of prevalence among different published studies (Aziz et al., 2020;Mao et al., 2020;Sultan et al., 2020;Patel et al., 2020;D'Amico et al., 2020;Jin et al., 2020;Xiao et al., 2020). Since COVID-19 pandemic is the cause of an immense world health crisis, new diagnostic and therapeutic methods are rapidly emerging (Alimadadi et al., 2020). The use of artificial intelligence is just one of them. Shortly after the COVID-19 outbreak, various machine learning algorithms have been implemented (Randhawa et al., 2020;Yan et al., 2020;Ge et al., 2021;Li et al., 2020). Machine learning helps quickly identify patterns and trends of the large volume of data, that are difficult for humans to recognize (Kushwaha et al., 2020). The availability of objective stratification tools for the rapid assessment of a patient status and prognosis is of great use for the frontline health-care providers (Bachtiger, Peters & Walsh, 2020).
The primary aim of this study is to assess the possible predictive factors for SARS-CoV-2 outcome based on gastrointestinal symptoms and liver related laboratory results using machine learning algorithms of the Random Forest (Guan et al., 2020;Breiman, 2001). The secondary aim is to determinate the prevalence of gastrointestinal symptoms among patients with COVID-19 within different groups based on the severity of the disease.

MATERIALS AND METHODS
The study was performed from February through May 2021. Only subjects aged 18 years or older were included in the study. All patients enrolled in this study had signed the informed consent.
This study was approved by the Ethical Committee of the University Hospital in Martin, decision number: 14/2021. Two distinct kinds of population were considered for this study. First group consists of patients who underwent nasopharyngeal swab in the outpatient hospital testing center for COVID-19 in order to determine whether they were SARS-CoV-2 positive. The method used for SARS-CoV-2 detection from nasopharyngeal swab was PCR (polymerase chain reaction). This group was then subdivided based on their positivity. The negative group was thereafter used as a control group for this study.
Second group consists of patients who attended COVID-19 emergency department (ED) in the University Hospital. These patients were confirmed positive from nasopharyngeal swab either by PCR or antigen method. Only patients with typical COVID-19 symptoms (fever, cough, dyspnoe) were included in this study. Patients who were SARS-CoV-2 positive but, didn't present with typical COVID-19 symptoms (e.g., patients who came to emergency room because of other diagnoses, but simultaneously were SARS-CoV-2 positive) were excluded. Therefore, we considered for this study only patients who were both tested positive and had at least one typical COVID-19 symptom.
The second group was then divided based on further evaluation and course of the disease. First subgroup consists of patients that didn't require admission to the hospital and were referred to the outpatient care. Second subgroup of patients was admitted to the hospital. Consequently, this group was observed until the end of hospitalization either because of their death or resolution of the disease. This subgroup was also divided for analysis purposes to patients who required medical care in standard hospital ward and those who needed intensive care unit (ICU).
Data was collected by using a questionnaire in the group from COVID-19 outpatient test center at the University Hospital. Data from emergency room was obtained with the same questionnaire which was combined with information from medical examination by an attending physician and from the mobile application MEDAsistent used for telemedicine follow-up developed at the Clinic of Pneumology and Phthisiology in the University Hospital in Martin. Further information (including laboratory tests results, chest X-ray etc.) about patients who were hospitalized has been obtained from hospital information system.
The questionnaire consists of questions related to the present health complaints typical for COVID-19 and the spectrum of most common gastrointestinal symptoms which occurred in the last 5-7 days before examination. Patients were also allowed to write down other presented symptoms in the case they were not in the original list. In order to include only new or worsened gastrointestinal symptoms in the study the questionnaire also contained questions about chronic gastrointestinal symptoms and their possible worsening in the last 5-7 days before examination.

Data analysis
The data was visualized and analyzed in R (R Development Core Team, 2021), version 4.0.5, with the aid of the libraries gtsummary (Sjoberg et al., 2020), rstatix (Kassambara, 2021), DescTools (Signorell, 2021), randomForestSRC (Ishwaran & Kogalur, 2021), PCAmixdata (Chavent et al., 2017) and ggpubr (Kassambara, 2020). The sample median and the lower and upper quartiles were used to summarize the data on continuous variables (e.g., age); counts and percentages were used to summarize factors (e.g., gender). The Chi-squared or Fisher test were applied to test the null hypothesis of independence between factors (gender vs group; fever vs group; cough vs group; diarrhea vs group; constipation vs group; bloating vs group; nausea vs group; heartburn vs group; abdominal pain vs group), followed, where appropriate, by multiple comparisons with the Benjamini Hochberg adjustment. Using a contingency table, an absence of trend was tested by Cochran Armitage test. The null hypothesis of the equality of the population medians of the continuous variable: age, Oxygen (O2) saturation, C-reactive protein (CRP), gamma glutamyltransferase (GMT), aspartate aminotrasferase (AST), Bilirubin) was tested by the Kruskal Wallis test, followed by the Dunn multiple comparisons test with the Benjamini Hochberg correction of p-values. Two-way ANOVA was used to model the association between AST and group (discharged home, admitted to hospital) in interaction with the recent ATB use. Another two-way ANOVA model was utilized to quantify the association between AST and group (discharged home, admitted to hospital) in interaction with chronic liver disease (yes, no). The AST values were log-transformed to bring data to normality. Normality of residuals was assessed by the quantile-quantile plot with the 95% confidence band constructed by bootstrap. Assumption of homogenity of variance was tested by the Levene test. In order to assess the predictive power of the gastrointestinal parameters and other measured variables (Gender, Age, No of Days of Symptoms, AST, alanine aminotrasferase /ALT/, Bilirubin, Recent antibiotics /ATB/ Usage, Diabetes Mellitus, Arterial Hypertension, Chronic Liver Disease, Fever, Cough, Diarrhea, Constipation, Bloating, Nausea, Heartburn, Abdominal Pain) for predicting the outcome of the patient group the Random Forest machine learning algorithm was trained on the data. The predictive ability was quantified by the ROC curve, constructed from the Out-of-Bag data. The Matthews correlation coefficient was used as a one-number summary of the quality of binary classification. Importance of the predictors was measured by the Variable Importance. A 2D representation of the data (the predictors used in Random Forest; i.e., Gender, Age, Number of Days of Symptoms, AST, ALT, Bilirubin, Recent ATB Usage, Diabetes Mellitus, Arterial Hypertension, Chronic Liver Disease, Fever, Cough, Diarrhea, Constipation, Bloating, Nausea, Heartburn, Abdominal Pain) was obtained by Principal Component Analysis for a mixed type of data. Findings with the p-value below 0.05 were considered statistically significant.

RESULTS
A total of 710 patients were enrolled in the study. Thirty (30) patients were excluded from the further analysis after primary screening. Participants (n = 352) from the outpatient center who were tested PCR negative for SARS-CoV-2 virus were considered as the control group. SARS-CoV-2 positive group from outpatient center included 166 participants. One hundred and sixty-two (n = 162) patients from emergency department were enrolled. From this group 78 patient (48%) were discharged home, 57 (35.3%) admitted to the hospital for standard care until discharged from hospital. Twenty-seven (27) (16.7%) patients required intensive care unit. Based on age, the groups from outpatient center had almost similar median of 42 and 41 years of age respectively. Hospitalized patients were significantly older as shown in the Table 1. The presence of typical COVID-19 symptoms such as fever and cough were significantly higher in the hospitalized groups as opposed to outpatient participants. There were no significant differences between groups in the men to women ratio.

Gastrointestinal symptoms occurrence and laboratory findings (Tables 1 and 2)
The presence of diarrhea, constipation, bloating, nausea, heartburn and abdominal pain was considered in this study. Presence of diarrhea and nausea was significantly higher in SARS-CoV-2 positive patients than in SARS-CoV-2 negative controls. Comparing SARS-Cov-2 negative and SARS-CoV-2 positive participants the cumulative presence of diarrhea is 21.3% (70/328) in the positive group (combined outpatient center and emergency department) vs 6.2% (22/352) in the negative group and for nausea it is 13.1% (43/328) in the positive group vs 3.4% (12/352) in the negative group. This trend goes further considering ED patients and the severity of disease.
Among gastrointestinal symptoms, diarrhea and bloating were significantly more often manifested in patients who were admitted to the hospital compared to those discharged home (40% for diarrhea and 14% for bloating vs 18% and 2.6% respectively). Other symptoms such as abdominal pain, heart burn, nausea, vomitus, anorexia, and constipation were not presented differently in these groups in the meaning of statistical significance. C-reactive protein was also significantly higher in hospitalized group. In case of alanin transaminase (ALT), aspartate transaminase (AST) and bilirubin as markers of possible liver damage only AST ( Fig. 1) was significantly higher in the hospitalized group. This difference is substantial. There is no statistically significant difference in the levels of ALT (Fig. 1) and Bilirubin when comparing different groups of patients.

Predictors of hospitalization based on machine learning
Based on the Random Forest algorithm with the data on demographic characteristics, symptoms and gastrointestinal related laboratory findings in hospitalized and discharged patients, several predictors for risk of hospitalization were identified. AST was pinpointed as the most important predictor followed by age and diabetes mellitus. Diarrhea and bloating have also positive importance, although much lower than AST. Gastrointestinal symptoms such as nausea, abdominal pain or anorexia have none or negative predictive importance. The ROC curve for combined factors is shown in the Fig. 2 with AUC 0.76. The Matthews correlation coefficient was 0.48.
When using only liver enzymes (AST, ALT), gastrointestinal symptoms (diarrhea and bloating), chronic liver disease, age and diabetes mellitus, the ROC curve (Fig. 3) for this combination of factors attained AUC 0.799 with AST as the strongest predictor for hospitalization (Table 3). The Matthews correlation coefficient was 0.37.

Principal components visualization of data
Principal component analysis was used to get a two-dimensional visualization of the data, for patients discharged home after ED examination and patients admitted to hospital. Data used for the analysis consist of the data from Table 2, that means a combination of general patient characteristics, typical COVID-19 symptoms and gastrointestinal  symptoms and liver related laboratory results. The PCA plot (Fig. 4) is showing two distinct clusters which are partially overlapping with tendencies to shift apart.

DISCUSSION
Several studies and meta-analyses have pointed out the gastrointestinal involvement in the SARS-CoV-2 infection (Mao et al., 2020;Sultan et al., 2020;D'Amico et al., 2020;Xiao et al., 2020;Pan et al., 2020;Villapol, 2020;Galanopoulos et al., 2020). The data from the pooled prevalence of gastrointestinal symptoms are varying significantly from 10.5% to 53% between studies (Mao et al., 2020;Sultan et al., 2020;Pan et al., 2020;Ashktorab et al., 2021). Based on comprehensive meta-analysis by Sultan et al. (2020), the pooled prevalence of diarrhea is 7.7%, nausea and vomiting 7.8% and abdominal pain 3.6%. In the presented study we have focused on the presence of diarrhea, constipation, bloating, nausea, heart burn and abdominal pain. Statistically significant differences have been found in the case of diarrhea and nausea when comparing SARS-CoV-2 negative and positive patients. In the group of hospitalized patients (with standard care) the diarrhea was presented in 40% patients and nausea in 21%, which is higher compared to some meta-analysis mentioned, but consistent with the data considering general presence of gastrointestinal symptoms and gut involvement. When comparing only emergency department group the presence of bloating is significantly higher in the hospitalized group than in those who were discharged home. Interestingly, bloating has lower prevalence in the group of ICU patients than in patients with standard care management. This could be explained by high subjectivity and interpersonal differences when reporting symptom such as bloating. Considering differences between these two groups of patients, those with more severe course of disease attach lower importance to less annoying symptoms such as heart burn, bloating and nausea when compared to more manifested symptoms such as diarrhea, abdominal pain or vomitus.
Focusing on the liver enzymes as markers of possible liver impairment resulting from SARS-CoV-2 infection the AST, ALT and bilirubin were considered for the evaluation. The results are showing that median level of liver enzymes was not elevated in the discharged group. Bilirubin and ALT were also within normal range in the hospitalized group with no statistically significant differences between these two groups. Only AST was elevated over the upper level of the reference value in the hospitalized group with  progressively higher values in patients who required ICU. The differences between hospitalized and discharged patients are substantially significant. Several previously published data have shown an elevation in both transaminases and bilirubin to a different extent ranging from 1% to 53% (mainly ALT and AST accompanied by slightly increased bilirubin concentrations) (Mao et al., 2020). In most published data, severe liver alterations were uncommon (Marasco et al., 2021) and the pooled prevalence of liver injury regarding severity was 12% based on the meta-analysis by Mao et al. (2020). More severe liver injury was also associated with worse outcomes, including intensive care unit admission and mortality (Phipps et al., 2020).
The pathophysiology of liver involvement in COVID-19 is still not completely understood. The direct viral infection of the liver cells is proposed as one of potential causes of liver injury, but the comprehensive studies are scarce. A study with pathological analysis of liver tissues from dead victims of COVID-19 showed no viral inclusions in hepatocytes . Another repeatedly proposed and generally accepted mechanism of liver impairment could be drug toxicity (Mao et al., 2020). In order to determine the possible influence of recent ATB usage on the elevation of AST presented in this paper, a two-way analysis of variance (two-way ANOVA) was performed. There are no significant differences between the groups with or without recent antibiotics usage. Therefore, we have concluded that ATB usage has no relevant influence on the elevated AST levels. The two-way ANOVA was also performed to assess the relationship between the presence of chronic liver disease and AST. There is no statistically relevant difference in AST levels in hospitalized patients with and without chronic liver disease. Another possible explanation of elevated transaminases is that it could be the result of a systemic inflammation. ALT is an enzyme most commonly found in liver, with small levels in striated muscle tissue and myocardium. On the other hand, AST could be found in liver, but also in striated and myocardial muscle, kidneys, brain and red blood cells. AST had been used as a marker for myocardial infarction for a long time before more sensitive markers were identified and implemented into the routine clinical practice (Ndrepepa, 2021). Based on the results of this study and current knowledge of SARS-CoV-2 interaction in human organism it is possible that elevated levels of AST in COVID-19 patients could be the result of a systemic inflammation with general tissue hypoperfusion rather than a result of a direct influence of SARS-CoV-2 virus on the hepatocytes or hepatotoxic drug use.
Further, we focused on identifying the possible predicting factors for hospitalization in COVID-19 patients using the Random Forest (RF) machine learning algorithm.
Different types of machine learning are being used in an increased rate to determine the predictors of outcome in various areas of clinical practice from brain trauma injuries (Hanko et al., 2021), radiology (Choy et al., 2018), oncology (Cruz & Wishart, 2017) to dermatology (Rajkomar, Dean & Kohane, 2019). Since COVID-19 pandemic has been affecting the global population for more than two years now and it is the cause of an immense health crisis in most world countries new diagnostic tools-machine learning being one of them-and therapeutic methods have been rapidly emerging (Alimadadi et al., 2020). Shortly after the COVID-19 outbreak various machine learning techniques were used, including taxonomic classification of COVID-19 genomes (Randhawa et al., 2020), determining the predictors of severe COVID-19  and searching for new potential drug candidates against SARS-CoV-2 viral infection (Ge et al., 2021). Another example of a successful implementation of artificial intelligence in COVID-19 diagnosis is the evaluation of the CT scans detecting SARS-CoV-2 associated pneumonia and their differentiation from the community acquired pneumonia and other similar conditions with specificity and sensitivity higher than 90% .
So far, several studies have been published using Random Forest Machine Algorithm for identifying the predictors for COVID-19 outcome from a wide variety of symptoms, socioeconomical factors (Wollenstein-Betech et al., 2020) and laboratory results with various results (Iwendi et al., 2020;Jie et al., 2020). To our current knowledge there are no studies specifically focused on the gastrointestinal symptoms and gut related laboratory findings to this date.
In order to assess the predictive power of the gastrointestinal parameters and other measured variables for predicting the need for hospitalization the Random Forest machine learning algorithm was trained on the data from our study. Random Forest has become the Machine Learning method of choice for several reasons: (a) it usually appears among (c) it does not overfit; (d) and last but not least, by its construction it provides a realistic estimate of the performance on a future data via the Out-Of-Bag data. Moreover, Random Forest, at least as implemented in the R library randomForestSRC, provides two different measures of importance of predictors. For these reasons, we have selected RF algorithm to assess predictive power of the studied variables, and to obtain their ranking.
Results were plotted as a ROC curve obtained from the Out-Of-Bag data. When considering the general COVID-19 symptoms, gastrointestinal symptoms, age, sex, lasting of the symptoms and comorbidities (diabetes mellitus, arterial hypertension and chronic liver diseases) the AUC is 0.76. The variable importance plot is shown in Fig. 5. When measuring the variable importance, the most important predictor is AST followed by age and diabetes mellitus, which are substantially less important. When using only liver enzymes (AST, ALT), gastrointestinal symptoms (diarrhea and bloating), age and presence of chronic liver disease and diabetes mellitus the AUC is 0.799 with AST as the strongest predictor for hospitalization. The variable importance plot is shown in Fig. 6. Previously published studies, which used mostly the methods of classical statistics, have Figure 6 Variable importance plot for selected factors. Variable importance plot for selected factors that are fast and easy to measure in the emergency department setting (liver enzymes: AST and ALT, gastrointestinal symptoms /diarrhea and bloating/, age and presence of chronic liver disease and diabetes mellitus). The positive value of importance of a predictor represents a positive factor for the predictive accurancy of the Random Forest algorithm. The negative value of importance of a predictor indicates that omitting the predictor increases the predictive accuracy of the Random Forest algorithm.
Full-size  DOI: 10.7717/peerj.13124/ fig-6 identified the presence of gastrointestinal symptoms , predominantly diarrhea (Aumpan, Nunanan & Vilaichone, 2020;Ghoshal et al., 2020) and elevated liver enzymes (Aziz et al., 2020) as predictors of hospitalization associated with COVID-19. In our data, we have singled out aspartate transaminase (AST) as not only the statistically significantly elevated liver enzyme in patients requiring hospitalization, but using the artificial intelligence with the Random Forest algorithm the AST proved to be the most important predictor of hospitalization. Finally, we performed the principal component analysis for mixed type of data in order to obtain a two-dimensional representation of the data on patients who were discharged home and those who were admitted to hospital. As could be seen on the Plot 4 these two groups are partially overlapping, but with clear tendencies to shift apart, which is in accordance with the predictive performance of the studied variables in the Random Forest algorithm.

CONCLUSIONS
This study has identified elevated AST for being the most important predictor for COVID-19 related hospitalizations using the machine learning Random Forest algorithm.
We have also shown that SARS-CoV-2 positivity is connected with isolated AST elevation and the level is linked with the severity of the disease. Furthermore, the prevalence of diarrhea and nausea among SARS-CoV-2 positive patients is significantly higher compared to SARS-CoV-2 negative controls. Bloating is occurring significantly more frequently in COVID-19 patients who require hospitalization than those who could be discharged to outpatient care.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This publication has been produced with the support of: The Integrated Infrastructure Operational Program for the project: Research and development of telemedicine solutions to support the fight against pandemic diseases induced COVID-19 and reducing its negative consequences by monitoring the health status of people in order to eliminate the risk of infection in at-risk populations, ITMS: 313011ASY8, co-financed by the European